{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NPA: Neural News Recommendation with Personalized Attention\n",
    "NPA \\[1\\] is a news recommendation model with personalized attention. The core of NPA is a news representation model and a user representation model. In the news representation model we use a CNN network to learn hidden representations of news articles based on their titles. In the user representation model we learn the representations of users based on the representations of their clicked news articles. In addition, a word-level and a news-level personalized attention are used to capture different informativeness for different users.\n",
    "\n",
    "## Properties of NPA:\n",
    "- NPA is a content-based news recommendation method.\n",
    "- It uses a CNN network to learn news representation. And it learns user representations from their clicked news articles.\n",
    "- A word-level personalized attention is used to help NPA attend to important words for different users.\n",
    "- A news-level personalized attention is used to help NPA attend to important historical clicked news for different users.\n",
    "\n",
    "## Data format:\n",
    "For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall\n",
    " and MINDlarge, please change the dowload source.\n",
    " Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.\n",
    " \n",
    "**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)\n",
    "\n",
    "### news data\n",
    "This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.\n",
    "One simple example: <br>\n",
    "\n",
    "`N46466\tlifestyle\tlifestyleroyals\tThe Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By\tShop the notebooks, jackets, and more that the royals can't live without.\thttps://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata\t[{\"Label\": \"Prince Philip, Duke of Edinburgh\", \"Type\": \"P\", \"WikidataId\": \"Q80976\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [48], \"SurfaceForms\": [\"Prince Philip\"]}, {\"Label\": \"Charles, Prince of Wales\", \"Type\": \"P\", \"WikidataId\": \"Q43274\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [28], \"SurfaceForms\": [\"Prince Charles\"]}, {\"Label\": \"Elizabeth II\", \"Type\": \"P\", \"WikidataId\": \"Q9682\", \"Confidence\": 0.97, \"OccurrenceOffsets\": [11], \"SurfaceForms\": [\"Queen Elizabeth\"]}]\t[]`\n",
    "<br>\n",
    "\n",
    "In general, each line in data file represents information of one piece of news: <br>\n",
    "\n",
    "`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`\n",
    "\n",
    "<br>\n",
    "\n",
    "We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.\n",
    "\n",
    "### behaviors data\n",
    "One simple example: <br>\n",
    "`1\tU82271\t11/11/2019 3:28:58 PM\tN3130 N11621 N12917 N4574 N12140 N9748\tN13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`\n",
    "<br>\n",
    "\n",
    "In general, each line in data file represents one instance of an impression. The format is like: <br>\n",
    "\n",
    "`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`\n",
    "\n",
    "<br>\n",
    "\n",
    "User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>\n",
    "\n",
    "`[News ID 1]-[label1] ... [News ID n]-[labeln]`\n",
    "\n",
    "<br>\n",
    "Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Global settings and imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System version: 3.6.11 | packaged by conda-forge | (default, Aug  5 2020, 20:09:42) \n",
      "[GCC 7.5.0]\n",
      "Tensorflow version: 1.15.2\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "sys.path.append(\"../../\")\n",
    "import os\n",
    "import numpy as np\n",
    "import zipfile\n",
    "from tqdm import tqdm\n",
    "import scrapbook as sb\n",
    "from tempfile import TemporaryDirectory\n",
    "import tensorflow as tf\n",
    "tf.get_logger().setLevel('ERROR') # only show error messages\n",
    "\n",
    "from reco_utils.recommender.deeprec.deeprec_utils import download_deeprec_resources \n",
    "from reco_utils.recommender.newsrec.newsrec_utils import prepare_hparams\n",
    "from reco_utils.recommender.newsrec.models.npa import NPAModel\n",
    "from reco_utils.recommender.newsrec.io.mind_iterator import MINDIterator\n",
    "from reco_utils.recommender.newsrec.newsrec_utils import get_mind_data_set\n",
    "\n",
    "print(\"System version: {}\".format(sys.version))\n",
    "print(\"Tensorflow version: {}\".format(tf.__version__))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "epochs = 5\n",
    "seed = 42\n",
    "batch_size = 32\n",
    "\n",
    "# Options: demo, small, large\n",
    "MIND_type = 'demo'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download and load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 17.0k/17.0k [00:01<00:00, 9.66kKB/s]\n",
      "100%|██████████| 9.84k/9.84k [00:01<00:00, 9.01kKB/s]\n",
      "100%|██████████| 95.0k/95.0k [00:09<00:00, 10.0kKB/s]\n"
     ]
    }
   ],
   "source": [
    "tmpdir = TemporaryDirectory()\n",
    "data_path = tmpdir.name\n",
    "\n",
    "train_news_file = os.path.join(data_path, 'train', r'news.tsv')\n",
    "train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')\n",
    "valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')\n",
    "valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')\n",
    "wordEmb_file = os.path.join(data_path, \"utils\", \"embedding.npy\")\n",
    "userDict_file = os.path.join(data_path, \"utils\", \"uid2index.pkl\")\n",
    "wordDict_file = os.path.join(data_path, \"utils\", \"word_dict.pkl\")\n",
    "yaml_file = os.path.join(data_path, \"utils\", r'npa.yaml')\n",
    "\n",
    "mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)\n",
    "\n",
    "if not os.path.exists(train_news_file):\n",
    "    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)\n",
    "    \n",
    "if not os.path.exists(valid_news_file):\n",
    "    download_deeprec_resources(mind_url, \\\n",
    "                               os.path.join(data_path, 'valid'), mind_dev_dataset)\n",
    "if not os.path.exists(yaml_file):\n",
    "    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \\\n",
    "                               os.path.join(data_path, 'utils'), mind_utils)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create hyper-parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data_format=news,iterator_type=None,support_quick_scoring=False,wordEmb_file=/tmp/tmpump0ai7m/utils/embedding.npy,wordDict_file=/tmp/tmpump0ai7m/utils/word_dict.pkl,userDict_file=/tmp/tmpump0ai7m/utils/uid2index.pkl,vertDict_file=None,subvertDict_file=None,title_size=10,body_size=None,word_emb_dim=300,word_size=None,user_num=None,vert_num=None,subvert_num=None,his_size=50,npratio=4,dropout=0.2,attention_hidden_dim=200,head_num=4,head_dim=100,cnn_activation=relu,dense_activation=None,filter_num=400,window_size=3,vert_emb_dim=100,subvert_emb_dim=100,gru_unit=400,type=ini,user_emb_dim=100,learning_rate=0.0001,loss=cross_entropy_loss,optimizer=adam,epochs=5,batch_size=32,show_step=100000,metrics=['group_auc', 'mean_mrr', 'ndcg@5;10']\n"
     ]
    }
   ],
   "source": [
    "hparams = prepare_hparams(yaml_file, \n",
    "                          wordEmb_file=wordEmb_file,\n",
    "                          wordDict_file=wordDict_file, \n",
    "                          userDict_file=userDict_file,\n",
    "                          batch_size=batch_size,\n",
    "                          epochs=epochs)\n",
    "print(hparams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "iterator = MINDIterator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the NPA model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = NPAModel(hparams, iterator, seed=seed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "8874it [01:15, 117.03it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'group_auc': 0.5228, 'mean_mrr': 0.2328, 'ndcg@5': 0.2377, 'ndcg@10': 0.303}\n"
     ]
    }
   ],
   "source": [
    "print(model.run_eval(valid_news_file, valid_behaviors_file))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [00:34, 31.34it/s]\n",
      "8874it [01:13, 119.95it/s]\n",
      "4it [00:00, 35.00it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 1\n",
      "train info: logloss loss:1.5035385669485202\n",
      "eval info: group_auc:0.5867, mean_mrr:0.2555, ndcg@10:0.3432, ndcg@5:0.2778\n",
      "at epoch 1 , train time: 34.7 eval time: 84.3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [00:30, 35.30it/s]\n",
      "8874it [01:13, 120.34it/s]\n",
      "4it [00:00, 35.45it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 2\n",
      "train info: logloss loss:1.4050482461026579\n",
      "eval info: group_auc:0.5996, mean_mrr:0.2706, ndcg@10:0.3589, ndcg@5:0.2967\n",
      "at epoch 2 , train time: 30.8 eval time: 83.1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [00:31, 34.98it/s]\n",
      "8874it [01:13, 119.93it/s]\n",
      "4it [00:00, 35.09it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 3\n",
      "train info: logloss loss:1.3529314152333838\n",
      "eval info: group_auc:0.5992, mean_mrr:0.275, ndcg@10:0.3646, ndcg@5:0.3005\n",
      "at epoch 3 , train time: 31.0 eval time: 83.4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [00:30, 35.45it/s]\n",
      "8874it [01:14, 119.89it/s]\n",
      "4it [00:00, 35.11it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 4\n",
      "train info: logloss loss:1.3024913867314656\n",
      "eval info: group_auc:0.5942, mean_mrr:0.2695, ndcg@10:0.3563, ndcg@5:0.2924\n",
      "at epoch 4 , train time: 30.6 eval time: 83.4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [00:30, 35.38it/s]\n",
      "8874it [01:14, 119.70it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 5\n",
      "train info: logloss loss:1.2650439398394104\n",
      "eval info: group_auc:0.6005, mean_mrr:0.2711, ndcg@10:0.3586, ndcg@5:0.2936\n",
      "at epoch 5 , train time: 30.7 eval time: 83.7\n",
      "CPU times: user 13min 25s, sys: 59.5 s, total: 14min 25s\n",
      "Wall time: 9min 35s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<reco_utils.recommender.newsrec.models.npa.NPAModel at 0x7f92704b0f98>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "8874it [01:14, 119.59it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'group_auc': 0.6005, 'mean_mrr': 0.2711, 'ndcg@5': 0.2936, 'ndcg@10': 0.3586}\n",
      "CPU times: user 2min 5s, sys: 9.68 s, total: 2min 14s\n",
      "Wall time: 1min 24s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "res_syn = model.run_eval(valid_news_file, valid_behaviors_file)\n",
    "print(res_syn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sb.glue(\"res_syn\", res_syn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_path = os.path.join(data_path, \"model\")\n",
    "os.makedirs(model_path, exist_ok=True)\n",
    "\n",
    "model.model.save_weights(os.path.join(model_path, \"npa_ckpt\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Output Predcition File\n",
    "This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).\n",
    "\n",
    "Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "8874it [01:14, 119.45it/s]\n"
     ]
    }
   ],
   "source": [
    "group_impr_indexes, group_labels, group_preds = model.run_slow_eval(valid_news_file, valid_behaviors_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "7538it [00:00, 23050.36it/s]\n"
     ]
    }
   ],
   "source": [
    "with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:\n",
    "    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):\n",
    "        impr_index += 1\n",
    "        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()\n",
    "        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'\n",
    "        f.write(' '.join([str(impr_index), pred_rank])+ '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)\n",
    "f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "\\[1\\] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang and Xing Xie: NPA: Neural News Recommendation with Personalized Attention, KDD 2019, ADS track.<br>\n",
    "\\[2\\] Wu, Fangzhao, et al. \"MIND: A Large-scale Dataset for News Recommendation\" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>\n",
    "\\[3\\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (reco_gpu)",
   "language": "python",
   "name": "reco_gpu"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
